For the final project, I really wanted to use data avaialble to me from my job. However, the few datasets that initially looked promising were all in relational format, and the effort to denormalize the data into a single flat table was going to be significant.

So I went with the white wine data set.

Initial Exploration

The first thing I did, of course, was to load the data set into R studio. I looked at the names of the columns and summary data for each column:

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

From this, it appears that ‘X’ is simply a unique identifier for the wine. The other 12 variables appear to be continuous measurements of the chemical properties of each wine variety. Quick observations can be made from the summary output. The mean and median of fixed acidity show that most wines are acidic (below a pH of 7.0). At least one wine is extrarodinarily basic, with a pH of 14.2. In fact, my knowledge of chemistry is that pH has a maximum value of 14, so this measurement is either erroneous, or isn’t a true pH value, or my understanding is wrong. (Perhaps)

Volatile acidity and Citric Acid measurements also have mean and median values well below the maximum value. So I expect to see at least one outlier in each of these measurements. It will be interesting to see if the same wine is responsible for the outlier in each of these attributes.

It turns out that this is where it’s helpful to read the documentation. According to the text file that accompanies the data set, these three columns are not pH values, but are rather measurements of tartaric acid, acetic acid, and citric acid in grams per cubic decimeter. My college background is in laboratory medicine, so this unit seems odd to me. Google Calculator tells me that one gram per cubic decimeter is the same as one milligram per milliliter, which helps give me some context.

Therefore, my observations above should be reinterpreted to mean that at least one wine has an exceedingly high acid content in each of these three categories. It reamins to bee seen if the same wine is responsible for the outlier for all measurements, or if different wines are responsible for the outlier(s) in each case.

Most of the variables have similar outliers. The exceptions seem to be density (which makes sense; wine should have a density close to that of water), and pH, which appears to be in a relatively narrow range. However, pH is a logarithmic scale so my initial interpretation might need to be examined more closely.

Quality, the output variable, is not well documented. I’m going to assume that this is a subjective score of the wine, hopefully one generated by a judge or judges that are trained in wine tasting. Given the range of values, I’m guessing wines were scored on a scale of 1 to 10 (or 0 to 10). The difference between the 1st and 3rd quartiles is very small; most wines are going to have a score of 5 or 6. That does make that maximum value more interesting: if most wines score a 5 or 6, I want to taste that wine that scored a 9!

It appears that there are no categorical variables in the dataset. Let’s check:

str(pf)
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

We see that all of the input data is continuous data, and the output is an integer. That makes sense to me given the documentation on the dataset. (I suppose an integer is actually an ordered categorical variable.)

Initial Exploration - Single Variable

There are eleven input variables. While I could do the same single-variable analysis on each of these eleven inputs, I don’t think that’s likely to be a productive application of the course material. When I look at the variables, I note that there are several measurements related to acidity: fixed acidity, volatile acidity, citric acid, and pH all jump out as similar measures. Later, I’ll probably want to explore relationships between these variables. But for now, I’m going to look a little more at each of these variables in isolation. I plotted histograms of each of these variables:

qplot(x = fixed.acidity, data = pf, binwidth = 0.5)

qplot(x = volatile.acidity, data = pf, binwidth = 0.1)

qplot(x = citric.acid, data = pf, binwidth = 0.05)

qplot(x = pH, data = pf, binwidth = 0.025)
## Warning: position_stack requires constant width: output may be incorrect

As the summary suggested, we see outliers on the top end of fixed acidity, volatile acidity, and citric acid. Even pH appears to have outliers at the upper end of the curve. However, an outlier at the upper end of the pH scale means a value is LESS acidic, where the other graphs get MORE acidic as we move to the right of the graph. In any case, it will be interesting to see later if the same wine(s) are responsible for the outliers on all the graphs, or if different wines are responsible for the individual variable outliers.

Without knowing how this data was measured and collected, it’s impossible to know if the outliers are valid data or if they were inaccurate measurements. With no evidence to the contrary, I think that we have to treat all data as valid.

I see that fixed acidity appears to be a more-or-less Normal distribution. Volatile acidity has a longer tail to the right than it does to the left, but still has a unimodal distribution. Citric acid has a few outliers that are WAY down the graph but is similar to the distribution of volatile acidity. Lastly, pH looks like a Normal distribution, but with an odd stair-stepping pattern. I suspect this is an artifact of the resolution of the equipment used to take the measurements; it’s hard to believe that such a pattern would occur naturally in such a large sample of wines. I found this plot as I was playing with bindwidths and included it as an example of what a ‘wrong’ application of visualization might look like.

Two-Variable Analysis

The first two-varaible relationship that I wanted to explore is the relationship between each of the acidity measurements and pH. I think I’m likely to find some interesting relationships there. I don’t understand the relationships that these individual measures have, but I do know that pH is a well understood measurement of the acidity of a liquid. I wanted to know if any of the individual acidity measurements are strongly correlated with pH. To do this, I created scatter plots of fixed acidity, volatile acidity, and citric acid and plotted them against pH:

ggplot(aes(x=pH, y=fixed.acidity), data=pf) + geom_point()

ggplot(aes(x=pH, y=volatile.acidity), data=pf) + geom_point()

ggplot(aes(x=pH, y=citric.acid), data=pf) + geom_point()

I don’t see a striking relationship between any of these variables and pH as I had hoped. However, there are two observations to be made. The first is that fixed acidity does appear to have a negative, possibly linear, correlation with pH. There’s a pretty wide distribution from the best fit line, but I think an argument can be made that this relationship probably exists by looking at the plot. (Adding a basic knowledge of chemistry to looking at the plot makes this argument stronger.) Neither volatile acidity nor citric acid appear to have an observable correlation with pH.

The second thing I notice in these plots is an odd horizontal line at citric acid = 0.75, and another at 0.5. There are enough wines with citric acid just above 0.75 and just below 0.75 that I don’t think this is likely to be a measurement error. If we were looking at a manufactured product, I’d suspect that perhaps there might be a legal or regulatory or market-driven force that would create a boundary like this. However, wines are the product of fermentation rather than direct manipulation of ingredients, so I don’t think that’s the case here. Without further data, I have no explanation for why there appears to be such a distinct cutoff for this measurement.

I decided to overlay the best fit line on the fixed acidity vs. pH plot, as a refresher for myself and to prepare myself for creating more complicated plots:

ggplot(aes(x=pH, y=fixed.acidity), data=pf) + geom_point() + stat_smooth(method="lm", se=FALSE)

This plot doesn’t really tell me anything that I didn’t already discover in the previous plot, but it can help to call attention to the relationship between the two variables.

Because Quality is the ‘output variable’ in this data set, I wanted to see if any of the individual measurements had any correlation with quality. Examining eleven input variables would require eleven plots such as this one:

ggplot (aes(x=quality, y = fixed.acidity), data=pf) + geom_point()

This plot is not easy to read. Because the output variable is a categorical variable, there is no continuity in the graph. A jitter in the X axis can help with this visualization. Transparency can help avoid overplotting at the same time:

ggplot (aes(x=quality, y = fixed.acidity), data=pf) + geom_point(alpha = .2, position = position_jitter(width = 0.5)) 

While this plot is easier to read, it does risk giving the viewer a false sense that the X variable is a continuous variable.

Now that I have a strategy for plotting each of the eleven input varaibles against the output variable, I chose to plot each of the input varables against the output variable separately. Ideally, if I find a few variables that strongly correlate with the output variable, I’d like to find a way to present those in a single plot. But for my initial exploration I do want to look at each variable separately:

ggplot (aes(x=quality, y = fixed.acidity), data=pf) + 
  geom_point(alpha = .2, position = position_jitter(width = 0.5)) 

ggplot (aes(x=quality, y = volatile.acidity), data=pf) + 
  geom_point(alpha = .2, position = position_jitter(width = 0.5)) 

ggplot (aes(x=quality, y = citric.acid), data=pf) + 
  geom_point(alpha = .2, position = position_jitter(width = 0.5)) 

ggplot (aes(x=quality, y = residual.sugar), data=pf) + 
  geom_point(alpha = .2, position = position_jitter(width = 0.5)) 

ggplot (aes(x=quality, y = chlorides), data=pf) + 
  geom_point(alpha = .2, position = position_jitter(width = 0.5)) 

ggplot (aes(x=quality, y = free.sulfur.dioxide), data=pf) + 
  geom_point(alpha = .2, position = position_jitter(width = 0.5)) 

ggplot (aes(x=quality, y = total.sulfur.dioxide), data=pf) + 
  geom_point(alpha = .2, position = position_jitter(width = 0.5)) 

ggplot (aes(x=quality, y = density), data=pf) + 
  geom_point(alpha = .2, position = position_jitter(width = 0.5)) 

ggplot (aes(x=quality, y = pH), data=pf) + 
  geom_point(alpha = .2, position = position_jitter(width = 0.5)) 

ggplot (aes(x=quality, y = sulphates), data=pf) + 
  geom_point(alpha = .2, position = position_jitter(width = 0.5)) 

ggplot (aes(x=quality, y = alcohol), data=pf) + 
  geom_point(alpha = .2, position = position_jitter(width = 0.5)) 

None of these plots was as predictive as I had hoped to find. I can see that wines with high chlorides tend to score poorly (or at least below average). That’s really the only relationship that I was able to pick up visually on these plots. I also see that the measurements for alcohol appear to be discrete measurements, again probably based on the limitations of the equipment used to do the meaurements.

Let’s overlay a summary plot on top of the quality/alcohol plot.

ggplot (aes(x=quality, y = alcohol), data=pf) + 
  geom_point(alpha = .2, 
             position = position_jitter(width = 0.5),
             color = 'orange') +
  geom_line(stat = 'summary', fun.y = mean) 

This overlay shows something that was totally invisible in the scatterplot. We can take this a step further and include minimum and maximum alcohol values for each quality rating:

ggplot (aes(x=quality, y = alcohol), data=pf) + 
  geom_point(alpha = .2, 
             position = position_jitter(width = 0.5),
             color = 'orange') +
  geom_line(stat = 'summary', fun.y = mean) +
  geom_line(stat = 'summary', fun.y = min) +
  geom_line(stat = 'summary', fun.y = max)

Next, I wanted to see if there was any relationship between the three measurements of acidity. For each wine, I am thinking about plotting volatile acidity, fixed acidity, and citric acid on the same plot. The naive way to do this is to use the wine’s label as the X axis, and the three measurements as separate lines:

ggplot(aes(x=X, y=citric.acid), data=pf) + geom_line() 

This creates a mess. Since the X dimension is an arbitrary value, there’s no way to see trends or relationships. Instead, I chose one of the three measurements (citric acid) and plotted the other two measurements against that one:

ggplot(aes(x = citric.acid, y = volatile.acidity), data = pf) + 
  geom_line()

ggplot(aes(x = citric.acid, y = fixed.acidity), data = pf) + 
  geom_line()

I don’t see enough of a trend here that would warrant one plot with both of these curves on it. Let’s lastly look at the relationship between fixed and volatile acidity:

ggplot(aes(x=fixed.acidity, y = volatile.acidity), data = pf) + 
  geom_line()

I think these three plots do a good job demonstrating that there’s no relationship between the three measurements of acidity. I found that a little surprising, but it’s hard to argue with those graphs. Since I was thinking about acidity and plots, I wanted to do a frequency plot showing fixed acidity, volatile acidity, and citric acid vs. pH. This is actually three lines on the same plot rather than a true frequency plot:

ggplot (data = pf, aes(x = pH)) + 
  geom_line(aes(y = fixed.acidity, color = "red")) +
  geom_line (aes(y = volatile.acidity, color = "green")) +
  geom_line(aes(y = citric.acid, color = "blue"))

Because fixed acidity is so far off the other two measurements, I elected to scale the ctric acid plot so that relationships between these variables can be better seen. When presenting this plot, it would have to be communicated to the viewer that the relationships are only relative. I imagine that, somewhere, ggplot has the ability to plot 3 lines with different scales on the same graph, but I was unable to figure out how to do this, so I cheated in order to make the visuals work.

ggplot (data = pf, aes(x = pH)) + 
  geom_line(aes(y = fixed.acidity / 10), color = 'red') +
  geom_line (aes(y = volatile.acidity), color = 'green') +
  geom_line(aes(y = citric.acid), color = 'blue' ) 

This plot is a mess and probably isn’t very useful as it is displayed. However, I think the concept that I’m trying to illustrate here is appropriate for an exploratory data analysis; I just don’t know enough abour ggplot to coerce it into displaying what I’d like.

The evaluator suggested that it would be nice to include more multivariate plots in my project. As seen in the previous plot, this is hard to do when different measurements are not on similar scales, or when different measurements are in different units. Looking at the description of the measurements, residual sugar, chlorides, and sulphates are all recorded in grams per cubic decimeter. Though I have no reason to believe these three are correlated, the fact that they’re all in the same units means that I can plot them on the same graph easily:

ggplot (data = pf, aes(x = X)) +
  geom_line(aes(y = residual.sugar), color = 'red') +
  geom_line(aes(y = chlorides), color = 'green') +
  geom_line(aes(y = sulphates))

Similar to the plot of acidity, what we find here is that the chlorides (green) are at such a lower level than the sugars (red) that it becomes impossible to visually compare them. I decided to scale the chlorides (by the means of each measurement) to be able to make visual comparison, but which also means the Y-axis is no longer a concrete unit.

ggplot (data = pf, aes(x = X)) +
  geom_line(aes(y = residual.sugar), color = 'red') +
  geom_line(aes(y = chlorides * 5.2 / 0.043), color = 'green') +
  geom_line(aes(y = sulphates))

That’s looking better. Let’s do the same with the sulphates.

ggplot (data = pf, aes(x = X)) +
  geom_line(aes(y = residual.sugar), color = 'red') +
  geom_line(aes(y = chlorides * 5.2 / 0.043), color = 'green') +
  geom_line(aes(y = sulphates * 5.2 / .4898))

Now the means of all three lines are the same, but it’s hard to make any meaningful observations. I first thought that I’d try to order the X axis so the residual sugar values go from least to most. That way, we can see if there are any corresponding trends in chlorides or sulphates. While that sounds nice, the better solution is probably to make the X axis residual sugar. At the same time we can make the lines less opaque to make visualization easier:

ggplot (data = pf, aes(x = residual.sugar)) +
  #geom_line(aes(y = residual.sugar), color = 'red') +
  geom_line(aes(y = chlorides * 5.2 / 0.043), 
            color = 'green', alpha = 0.5) +
  geom_line(aes(y = sulphates * 5.2 / .4898), alpha = 0.5)

OK, now we can start seeing something. First, there’s more variation in chlorides than there is in sulphates. Second, there appear to be some wild outliers on the high end of residual sugar. Let’s limit the X axis:

ggplot (data = pf, aes(x = residual.sugar)) +
  #geom_line(aes(y = residual.sugar), color = 'red') +
  geom_line(aes(y = chlorides * 5.2 / 0.043), color = 'green', alpha = 0.5) +
  geom_line(aes(y = sulphates * 5.2 / .4898), alpha = 0.5) +
  scale_x_continuous(limits = c(0, 20))
## Warning: Removed 18 rows containing missing values (geom_path).
## Warning: Removed 18 rows containing missing values (geom_path).

At this point, though the means of both chlorides and sulphates in the plot are the same, it’s still hard to make comparisons since the variations are so different. Let me change the coefficients to see if I can make this more visually accessible:

ggplot (data = pf, aes(x = residual.sugar)) +
  #geom_line(aes(y = residual.sugar), color = 'red') +
  geom_line(aes(y = chlorides * 2 + .3), 
            color = 'green', alpha = 0.5) +
  geom_line(aes(y = sulphates), alpha = 0.5) +
  scale_x_continuous(limits = c(0, 20))
## Warning: Removed 18 rows containing missing values (geom_path).
## Warning: Removed 18 rows containing missing values (geom_path).

I chose the scale and offset of the chlorides line by trial and error; it only took 2 or 3 iterations. What I have now is a graph that could show correlations between residual sugar, chlorides, and sulphates (if any existed). I really don’t see anything that looks like a correlation in here, but let’s add best fit lines just to see:

ggplot (data = pf, aes(x = residual.sugar)) +
  #geom_line(aes(y = residual.sugar), color = 'red') +
  geom_line(aes(y = chlorides * 2 + .3), color = 'green', alpha = 0.5) +
  geom_line(aes(y = sulphates), alpha = 0.5) +
  scale_x_continuous(limits = c(0, 20)) + 
  stat_smooth(method="lm", se=FALSE, mapping = aes(y = chlorides * 2 + .3), color = 'green') +
  stat_smooth(method="lm", se=FALSE, mapping = aes(y = sulphates))
## Warning: Removed 18 rows containing missing values (stat_smooth).
## Warning: Removed 18 rows containing missing values (stat_smooth).
## Warning: Removed 18 rows containing missing values (geom_path).
## Warning: Removed 18 rows containing missing values (geom_path).

Just because we can draw a best fit line doesn’t mean that there’s any statistical significance to it. Let’s find the R^2 values of these two best fit lines.

cor.test(pf$residual.sugar, pf$chlorides)
## 
##  Pearson's product-moment correlation
## 
## data:  pf$residual.sugar and pf$chlorides
## t = 6.2299, df = 4896, p-value = 5.057e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.06082916 0.11640188
## sample estimates:
##        cor 
## 0.08868454
cor.test(pf$residual.sugar, pf$sulphates)
## 
##  Pearson's product-moment correlation
## 
## data:  pf$residual.sugar and pf$sulphates
## t = -1.8664, df = 4896, p-value = 0.06204
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.054630026  0.001343093
## sample estimates:
##         cor 
## -0.02666437

It’s clear that the corrlation coeffients here are pretty meaningless.

At the evaluator’s suggestion after my first submission, I plotted a boxplot of alcohol vs. quality:

qplot(x = factor(quality, ordered = 'True'), 
      y = alcohol, geom='boxplot', data = pf) 

This plot has some interesting properties. We can see that, in general, wines that are rated higher in quality tend to have higher alcohol content. This is particularly so with wines with the highest rating of 9. Wines with the highest rating also had fewer outliers. This is probably more a function of the number of wines that received the top score more than anything:

table(pf$quality)
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

With only 5 wines that scored 9 on the quality scale, there’s just less opportunity for variation than there is for the 2,198 wines that scored a 6. (On the other hand, the 20 wines that scored a 3 have a spectacularly wide range of alcohol values, so fewer wines = less variation isn’t necessarily true.)

Though I initially did plots of quality vs other variables as scatter plots, perhaps plotting the same data as box plots might show something that wasn’t visible in the scatter plots. Let me return to the three measures of acidity that I was looking at earlier:

qplot(x = factor(quality, ordered = 'True'), 
      y = citric.acid, geom='boxplot', data = pf) 

qplot(x = factor(quality, ordered = 'True'), 
      y = fixed.acidity, geom='boxplot', data = pf) 

qplot(x = factor(quality, ordered = 'True'), 
      y = volatile.acidity, geom='boxplot', data = pf) 

These three plots show that the three measures of acidity are pretty consistent across quality. The means and the limits of the first and third quartiles really don’t vary that much with quality, especially in relation to the overall ranges of these measures. I don’t think I’d invest any more time in exploring these variables.

Correlations

At this point I was a little discouraged that I didn’t find any relationships that appear worth further exploration. The next thing I wanted to do is look at all the correlations between the various attributes of the wines. I did this using the scatterplot matrix:

Unfortunately, there’s a lot of information crammed into a small space there, and it’s impossible to read this matrix. I was unable to find any documentation that would explain how to make this plot more readable. I was hoping that finding a high correlation coefficient or two would help me find additional relationships between variables that I could explore.

Final Plots

The first plot I want to highlight is the plot of pH vs. fixed acidity. I chose this plot because it is the one plot I found that has a definite, visually ovservable correlation, and therefore is a plot that would be suitable for discussion, analysis, or exposition in a presentation. The plot clearly shows that a lower pH is associated iwth higher fixed acidity.

ggplot(aes(x=pH, y=fixed.acidity), data=pf) + 
  geom_point(alpha = 0.3, color='goldenrod') + 
  stat_smooth(method="lm", se=FALSE) +
  ylab("Fixed Acidity (g/dm^3)") + 
  ggtitle("Fixed Acidity vs. pH of Tested Wines")

The second plot is a box plot that shows the alcohol content of wines for each quality rank.

ggplot(data = pf, aes(x = factor(quality, ordered='True'), y = alcohol)) +
  #add a boxplot with color fill
  geom_boxplot(aes(fill = factor(quality, ordered = 'True'))) +
  #add colored dots
  geom_point(aes(color = factor(quality, ordered = 'True'))) +
  #label the x and y axes
  ylab("Alcohol Content") +
  xlab("Wine Quality Rating") +
  #set a title for the plot
  ggtitle("Comparison of Alcohol Content vs. Wine Quality") +
  #assign colors to the fills and dots
  scale_fill_manual(values = c("blue", "green", "red", "cyan", "orange", "darkblue", "darkgreen")) +
  scale_color_manual(values = c("blue", "green", "red", "cyan", "orange", "darkblue", "darkgreen")) +
  #remove the (redundant) legend
  theme(legend.position = "none")

This plot shows the variations in alcohol content of wines of various ratings. You can see a clear upward trend of alcohol content from wines rated 5 to wines rated 9.

My third plot is similar to the second, but shows slightly different information. Here, I have overlayed the minimum, mean, and maximum alcohol values for each wine quality rating:

ggplot (aes(x=quality, y = alcohol), data=pf) + 
  geom_point(alpha = .4, 
             position = position_jitter(width = 0.5),
             color = 'orange') +
  geom_line(stat = 'summary', fun.y = mean) +
  geom_line(stat = 'summary', fun.y = min, color = "green") +
  geom_line(stat = 'summary', fun.y = max, color = "red") +
  xlab ("Wine Quality") +
  ylab ("Alcohol Content (g/dm^3)") +
  ggtitle ("Minimum, Mean, and Maximum Alcohol Content by Wine Quality Rating")

Reflection

This project was more involved than I thought it would be. The first thing that struck me about this project is that since the output variable ‘Quality’ is a categorical variable, it made it difficult to come up with good visualizations to show how each of the input variables affects the output variable.

I also found myself confused about some of the plot types. For example, the box plot is introduced in the single-variable lesson, but the box plot actually requires both an X and a Y variable, which I interpret to mean this is actually a two-variable plot.

My inability to find variables that showed a clear correlation to each other was frustrating. I wanted to dive deeper into a relationship or two but was unable to find any other than the relationship between fixed acidity and pH, which honestly isn’t a very revealing relationship to anyone who has had high school chemistry.

The dearth of categorical variables in the data set made it difficult to use some of the techinques illustrated in the lectures. For instance, faceting requires a categorical variable over which to facet the data.